A Few More Data Wrangling Tools

STAT 331

A few words about drop_na()

  • Easy tool to remove missing values
  • Unilaterally removes any row with a missing value for any variable
  • But you can specify what columns it should look at for missing values!

Summarizing Frequencies

count() – counts the values of one or more categorical variables

starwars |> 
  count(homeworld)

The sort argument specifies if the resulting tibble should be sorted in descending order

starwars |> 
  count(homeworld, 
        sort = TRUE)

Finding Unique Groups

distinct() – selects the unique / distinct rows from a dataset

Arguments

  • ... – variables to use when determining uniqueness
    • can use multiple!
  • .keep_all – decides if all of the columns should be kept
    • FALSE is default!

Discritizing Variables

  • if_else()
    • Useful when there are two options
  • case_when()
    • Useful when there are three or more options

What if I want to perform the same operation across multiple columns?

across()

makes it easy to apply the same transformation to multiple columns, allowing you to use select() semantics inside in “data-masking” functions like summarise() and mutate()


across(.cols = everything(), .fns = NULL, ...)

Summarizing Multiple Columns

starwars |> 
  summarise(
    across(
      height:mass, 
      mean, 
      na.rm = TRUE
      )
    )
# A tibble: 1 × 2
  height  mass
   <dbl> <dbl>
1   174.  97.3

Conditional Summarizing

starwars |> 
  summarise(
    across(
      where(is.numeric), 
      mean, 
      na.rm = TRUE
      )
    )
# A tibble: 1 × 3
  height  mass birth_year
   <dbl> <dbl>      <dbl>
1   174.  97.3       87.6

❤️ |>

starwars |> 
  drop_na(homeworld) |> 
  filter(gender == "feminine") |>
  ggplot(mapping = aes(x = homeworld, fill = homeworld)) + 
  geom_bar(position = "dodge") + 
  labs(title = "Homeworlds of Feminine Starwars Characters", 
       x = "") + 
  theme(legend.position = "none")